MATHEMATICAL ENGINEERING TECHNICAL REPORTS An Asymptotically Optimal Policy for Finite Support Models in the Multiarmed Bandit Problem

نویسندگان

  • Junya HONDA
  • Akimichi TAKEMURA
چکیده

We propose minimum empirical divergence (MED) policy for the multiarmed bandit problem. We prove asymptotic optimality of the proposed policy for the case of finite support models. In our setting, Burnetas and Katehakis [3] has already proposed an asymptotically optimal policy. For choosing an arm our policy uses a criterion which is dual to the quantity used in [3]. Our criterion is easily computed by a convex optimization technique and has an advantage in practical implementation. We confirm by simulations that MED policy demonstrates good performance in finite time in comparison to other currently popular policies.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MATHEMATICAL ENGINEERING TECHNICAL REPORTS Finite-time Regret Bound of a Bandit Algorithm for the Semi-bounded Support Model

In this paper we consider stochastic multiarmed bandit problems. Recently a policy, DMED, is proposed and proved to achieve the asymptotic bound for the model that each reward distribution is supported in a known bounded interval, e.g. [0, 1]. However, the derived regret bound is described in an asymptotic form and the performance in finite time has been unknown. We inspect this policy and deri...

متن کامل

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models

Multiarmed bandit problem is a typical example of a dilemma between exploration and exploitation in reinforcement learning. This problem is expressed as a model of a gambler playing a slot machine with multiple arms. We study stochastic bandit problem where each arm has a reward distribution supported in a known bounded interval, e.g. [0, 1]. In this model, Auer et al. (2002) proposed practical...

متن کامل

On the efficiency of Bayesian bandit algorithms from a frequentist point of view

In this contribution, we argue that algorithms derived from the Bayesian modelling of the multiarmed bandit problem are also optimal when evaluated using the frequentist cumulated regret as a measure of performance. We first show that the classical Gittins argument can be applied to convert the finite-horizon Bayesian multiarmed bandit problem into an MDP planning task that is numerically solva...

متن کامل

Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching - Automatic Control, IEEE Transactions on

We consider multiarmed bandit problems with switching cost, define uniformly good allocation rules, and restrict attention to such rules. We present a lower bound on the asymptotic performance of uniformly good allocation rules and construct an allocation scheme that achieves the bound. We discover that despite the inclusion of a switching cost the proposed allocation scheme achieves the same a...

متن کامل

Optimal Policies for a Class of Restless Multiarmed Bandit Scheduling Problems with Applications to Sensor Management

Consider the Markov decision problems (MDPs) arising in the areas of intelligence, surveillance, and reconnaissance in which one selects among different targets for observation so as to track their position and classify them from noisy data [9], [10]; medicine in which one selects among different regimens to treat a patient [1]; and computer network security in which one selects different compu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009